Skip to content

Guard *scanf() return type extension by counter#5641

Draft
hakre wants to merge 10 commits into
phpstan:2.1.xfrom
hakre:patch-1
Draft

Guard *scanf() return type extension by counter#5641
hakre wants to merge 10 commits into
phpstan:2.1.xfrom
hakre:patch-1

Conversation

@hakre
Copy link
Copy Markdown
Contributor

@hakre hakre commented May 11, 2026

Use the recently fixed PrintfHelper::getScanfPlaceholdersCount() to detect invalid format strings in the return type extension. If the format is uncountable, the call is guaranteed to fail – return NullType immediately.

This is the same approach that eliminated the count regression. It already fixes all false‑positive type inferences for malformed formats.

For example, invalid format (mixing positional %n$ with sequential %) returns null.

@staabm
Copy link
Copy Markdown
Contributor

staabm commented May 11, 2026

can you come up with a test, this change fixes?

@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 11, 2026

Absolutely — I already have a fixture for the exact example from the commit message (mixing positional %n$ with sequential %). I’ll push it now as a fixup.

@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 11, 2026

Wow make build didn't notice the missing string parameter, refreshed.

@hakre hakre force-pushed the patch-1 branch 2 times, most recently from 3b1a8e8 to c386aab Compare May 11, 2026 13:26
@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 11, 2026

Tests are green – the phpbench and Infection reds look pre‑existing / unrelated to this change. Let me know if you’d like me to dig into those or if this is good to go. 😊

@hakre hakre force-pushed the patch-1 branch 2 times, most recently from e4293ad to e38254a Compare May 11, 2026 13:55
@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 11, 2026

Created a try snippet so it's better to compare with before and now aligned with the fixture that had another blooper: https://phpstan.org/r/087c4c5f-9a00-47cc-8d36-f21fb1b27a31

Comment thread tests/PHPStan/Analyser/nsrt/sscanf.php Outdated
}

function sscanfInvalidFormatMixingPositionalWithSequential(string $s) {
assertType('null', sscanf($s, '%1$s %s'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be NEVER or Error on PHP8+ since it will throw an error ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But given the fact the return type extension already use TypeCombinator::addNull somewhere else, I'm unsure...

Copy link
Copy Markdown
Contributor Author

@hakre hakre May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Yes, on PHP 8+ it should be never (the call throws, 3v4l.org). I kept null for now as a safe lower bound that doesn’t require a PhpVersion dependency in the extension. If you prefer the stricter version, I can inject PhpVersion and return NeverType for ≥8.0.

/E: found the pattern

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But given the fact the return type extension already use TypeCombinator::addNull somewhere else, I'm unsure...

... me either but I wanted to give it a try and put a fixup on top with php version id test data files (.php scripts) that mimic the same branching as for explode(). The fixup keeps all options open if you prefer something else – let me know.

@staabm
Copy link
Copy Markdown
Contributor

staabm commented May 11, 2026

Initially I thought using the new count helper, we could get rid of a regex path.

Since this is not the case, I think the PR as is does not provide much value.

We would already report a error for the invalid format.
It doesn't bring much value having a more precise return type IMO

@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 11, 2026

Thanks for the honest feedback! I’d like to clarify what this small PR actually fixes, because I think there might be a subtle but important distinction.

The existing rule already reports an error on an invalid format, yes. But the return type extension runs independently—and previously it would still infer array|null for that same invalid format. That false type can then leak into subsequent analysis (e.g., a later $result[0] or a foreach), producing confusing downstream errors that are only indirectly related to the real problem.

This PR adds a minimal gatekeeper that short‑circuits the type inference for any format the counter finds invalid. It returns NullType (or NeverType on PHP 8+), so no false array shape ever enters the type system. That’s exactly the same probe‑based approach that fixed the count regression you already merged.

So the value isn’t “a more precise return type” in isolation—it’s eliminating a class of false‑positive type information that otherwise pollutes later analysis steps. And it does so with a tiny, proven, self‑contained check that doesn’t touch the regex at all.

I’m happy to iterate or, if you’d rather, close this PR and save the gatekeeper for a later round. Just wanted to make sure the motivation is clear. 😊

@staabm
Copy link
Copy Markdown
Contributor

staabm commented May 12, 2026

So the value isn’t “a more precise return type” in isolation—it’s eliminating a class of false‑positive type information that otherwise pollutes later analysis steps.

please show a real-world like snippet on phpstan.org/try which proofs this statement

@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 12, 2026

Fair enough – here’s a real‑world‑like snippet:

function test(string $s): void {
    $result = sscanf($s, '%1$s %s');
    if ($result !== null) {
        echo count($result);
    }
}

Right now, PHPStan infers array|null and silently accepts the if body. With the gatekeeper (and *NEVER* on PHP 8+), it knows the call throws → the echo is dead code. That’s the false‑positive type pollution being removed.

(Note: PrintfParametersRule already reports the invalid format, but the return‑type extension was still leaking array|null separately. The gatekeeper stops that leak.)

Let me know if you’d like the snippet turned into a test fixture – happy to add it.

@hakre hakre force-pushed the patch-1 branch 2 times, most recently from 144d4e0 to 75e26f5 Compare May 13, 2026 13:54
@staabm
Copy link
Copy Markdown
Contributor

staabm commented May 14, 2026

Fair enough – here’s a real‑world‑like snippet:

your example does not contain a false positive as in

producing confusing downstream errors that are only indirectly related to the real problem.

@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 15, 2026

Fair point — I should have said "false-positive type information" rather than "errors", and I stand corrected. The old return type implies the call might succeed; the gatekeeper corrects that.

This PR is ready either way. If you'd like to merge it as the first step before fixing the regex, great; if not, I'll close it and revisit later. Just let me know. 😊

@hakre hakre marked this pull request as draft May 16, 2026 16:38
@hakre hakre force-pushed the patch-1 branch 2 times, most recently from 0161e88 to 89bba3d Compare May 16, 2026 19:40
hakre added 3 commits May 16, 2026 21:50
Use the recently fixed `PrintfHelper::getScanfPlaceholdersCount()` to
detect invalid format strings in the return type extension. If the
format is uncountable, the call is guaranteed to fail – return
`NullType` immediately.

This is the same approach that eliminated the count regression. It
already fixes all false‑positive type inferences for malformed
formats.

For example, invalid format (mixing positional %n$ with sequential %)
returns `null`.

Gegenprobe: the counter doesn’t guess – it asks PHP itself.
Return `NeverType` on PHP 8+ (where an invalid format throws
`ValueError`) and `NullType` on PHP < 8.0. This makes the gatekeeper
precise about the call never returning on modern PHP.

The test strategy for the version‑dependent types will be settled
in review – the CI now whispers `*NEVER*` where `null` stood before,
and `array{}|null` where `array|null` stood before.
Regression cases for the function's dynamic return type extension.

Taken from the earlier PR.

Paint the build red.
hakre added 4 commits May 16, 2026 21:50
In PHP’s sscanf/fscanf, a NUL byte (\0) in the format string terminates
parsing at the C level. The return type extension was not truncating the
format, so placeholders after a NUL could leak into the inferred type.

Truncate the format string at the first NUL byte before matching
conversion specifiers in SscanfFunctionDynamicReturnTypeExtension. The
counter already handles NUL correctly because the runtime sscanf call
stops at the NUL.
Refine return type, refine the test data.

Paint the build red again.
When the format string is empty (e.g. after NUL truncation, as can be
seen on 3v4l.org [1]), sscanf returns an empty array, not null, on all
PHP versions. Return the precise `array{}` type instead of falling
through to the default `array{}|null` signature.

[1]: https://3v4l.org/Xpbu5
No need to run the preg_match_all machinery when we already know
there are no placeholders – the counter has already told us.
hakre added 3 commits May 16, 2026 22:19
Update the return type extension to correctly handle the remaining
specifier patterns found in phpstan's source and testdata corpus, as
well as those reported in issues.

These are in part LLM inferred from the earlier PR, let's see how far
we can recycle them.

Will paint the build red for sure, likely more than once, let's see.
Update the return type extension to correctly handle the remaining
specifier patterns found in phpstan's source and testdata corpus, as
well as those reported in issues.

These are in part LLM inferred from the earlier PR, let's see how far
we can recycle them.

Will paint the build red for sure, likely more than once, let's see.

This looks better to me. Red still is the new green.
Update the return type extension to correctly handle the remaining
specifier patterns found in phpstan's source and testdata corpus, as
well as those reported in issues.

- Expand the regex to cover all valid specifiers (`%i`, `%D`, `%g`, ...)
  and handle length modifiers (`h`, `l`, `L`).
- Fix scanset parsing with `%[...]` to allow `]` as the first character
  inside the set, matching the runtime behavior.
- Eliminate false matches of `%%` using the "Black Magic" recipe
  by NikiC: `(?:%%)+(*SKIP)(*FAIL)|%`.
- Tighten string types to `non-empty-string` for `%s`, `%c`, `%[...]`,
  which always return at least one character on a successful match.
- Map `%u` to `int|non-falsy-string` to account for its overflow
  behavior.

These changes build on the previously merged counter fix and the
empty‑format / NUL‑byte guard.
@hakre
Copy link
Copy Markdown
Contributor Author

hakre commented May 16, 2026

I’ve been thinking about your hope to use the counter to circumvent the regex path. This PR now takes a first concrete step in that direction: the NUL‑byte fix and the builder lift make the counter the foundation for the array shape, and the regex overhaul (applied here) fixes all the known missing‑specifier cases. I’ve also recycled the test cases from the earlier bot‑generated PR to prove the fixes. I’d love to hear if this incremental approach aligns with what you had in mind – the regex is still here for precision, but the counter now guarantees correctness. Happy to adjust as you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants